Skip to content

Conversation

@yucai
Copy link
Contributor

@yucai yucai commented Oct 25, 2018

What changes were proposed in this pull request?

Refactor BenchmarkWideTable to use main method.
Generate benchmark result:

SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.WideTableBenchmark"

How was this patch tested?

manual tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bin/spark-submit --class <this class> <spark sql test jar> -> bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this is not a good way, cc @dongjoon-hyun

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @wangyum 's comment.
@yucai . In this case, we can not call this [TEST] or Refactor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding val splitThreshold = SQLConf.get.getConfString("spark.testing.codegen.splitThreshold", "1024").toInt to our To run this benchmark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyum Thanks for the suggestion! You prefer to modifying CodeGenerator.scala each time run this benchmark, right? I feel it could be kind of tricky to let user modify codes and if the CodeGenerator.scala changes in the future, it is hard to update the document here. @dongjoon-hyun @gengliangwang any suggestion?

Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we need advice from the right persons. :)

Copy link
Member

@gengliangwang gengliangwang Oct 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I don't think this is a good solution. We should start a new discussion about whether to make it configurable in production as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"10", "100", "1024", "8196", "65536" -> "10", "100", "1024", "2048", "4096", "8196", "65536"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BenchmarkWideTable -> WideTableBenchmark?

@SparkQA
Copy link

SparkQA commented Oct 25, 2018

Test build #98013 has finished for PR 22823 at commit f8bce47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yucai yucai force-pushed the BenchmarkWideTable branch from f8bce47 to b00396f Compare October 25, 2018 14:18
@yucai
Copy link
Contributor Author

yucai commented Oct 25, 2018

Thanks @wangyum for good suggestion!

@SparkQA
Copy link

SparkQA commented Oct 25, 2018

Test build #98018 has finished for PR 22823 at commit b00396f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 25, 2018

Test build #98022 has finished for PR 22823 at commit d2da3a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yucai yucai changed the title [SPARK-25676][SQL][TEST] Refactor BenchmarkWideTable to use main method [SPARK-25676][SQL][TEST] Improve BenchmarkWideTable: use main method and automate Oct 26, 2018
@yucai yucai changed the title [SPARK-25676][SQL][TEST] Improve BenchmarkWideTable: use main method and automate [SPARK-25676][SQL][TEST] Improve BenchmarkWideTable: use main method and automation Oct 26, 2018
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @davies and @cloud-fan and @kiszk .

This benchmark is added in Spark 2.1.0. This value 1k is determined by manually changing the split threhold.

This PR wants to add a configuration in CodeGenerator.scala for testing-purpose only.

  1. Is the configuration helpful in general purpose?
  2. If then, can we make another PR for that first?
  3. If not, is it allowed to add this testing parameter?

Copy link
Contributor

@cloud-fan cloud-fan Oct 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a PR to add this config officially. It should be useful for performance tuning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the decision, @cloud-fan !

@yucai . Please proceed to propose a new PR for only that new configuration (if you didn't start yet).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun I am working on #22847.

@SparkQA
Copy link

SparkQA commented Oct 29, 2018

Test build #98208 has finished for PR 22823 at commit 0cfac24.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

huaxz1986 pushed a commit to huaxz1986/spark that referenced this pull request Nov 5, 2018
…nction configurable

## What changes were proposed in this pull request?
As per the discussion in [apache#22823](https://github.com/apache/spark/pull/22823/files#r228400706), add a new configuration to make the split threshold for the code generated function configurable.

When the generated Java function source code exceeds `spark.sql.codegen.methodSplitThreshold`, it will be split into multiple small functions.

## How was this patch tested?
manual tests

Closes apache#22847 from yucai/splitThreshold.

Authored-by: yucai <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@dongjoon-hyun
Copy link
Member

Hi, @yucai .
Since #22847 is merged, could you rebase this to the master gain? Thanks!

@yucai yucai changed the title [SPARK-25676][SQL][TEST] Improve BenchmarkWideTable: use main method and automation [SPARK-25676][SQL][TEST] Refactor BenchmarkWideTable to use main method Nov 6, 2018
@yucai yucai force-pushed the BenchmarkWideTable branch from 0cfac24 to 83857d0 Compare November 6, 2018 02:14
@yucai
Copy link
Contributor Author

yucai commented Nov 6, 2018

@dongjoon-hyun Just push the rebased version, thanks!

@dongjoon-hyun
Copy link
Member

Thank you, @yucai . Could you update the title because we are renaming it? Maybe, [SPARK-25676][SQL][TEST] Rename and refactor BenchmarkWideTable to use main method?

@yucai yucai changed the title [SPARK-25676][SQL][TEST] Refactor BenchmarkWideTable to use main method [SPARK-25676][SQL][TEST] Rename and refactor BenchmarkWideTable to use main method Nov 6, 2018
@SparkQA
Copy link

SparkQA commented Nov 6, 2018

Test build #98501 has finished for PR 22823 at commit f714cc8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Could you review and merge yucai#7, @yucai ?

@SparkQA
Copy link

SparkQA commented Nov 6, 2018

Test build #98509 has finished for PR 22823 at commit a755039.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

split threshold 10 38932 / 39307 0.0 37128.1 1.0X
split threshold 100 31991 / 32556 0.0 30508.8 1.2X
split threshold 1024 10993 / 11041 0.1 10483.5 3.5X
split threshold 2048 8959 / 8998 0.1 8543.8 4.3X
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun In my mac, at most case, 2048 is the best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it with openjdk (in one Linux VM), 2048 is also the best.

================================================================================================
projection on wide table
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 3.10.0-693.11.1.el7.x86_64
Intel Core Processor (Haswell)
projection on wide table:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
split threshold 10                          23995 / 25673          0.0       22883.7       1.0X
split threshold 100                         12881 / 13419          0.1       12284.3       1.9X
split threshold 1024                          6435 / 7402          0.2        6137.2       3.7X
split threshold 2048                          5861 / 6766          0.2        5589.2       4.1X
split threshold 4096                          6464 / 7825          0.2        6164.6       3.7X
split threshold 8192                          7886 / 8742          0.1        7520.7       3.0X
split threshold 65536                       46143 / 48029          0.0       44005.6       0.5X

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yucai . I trust you. :) Don't worry about that. The scope of this PR is not for choosing the best option.

@SparkQA
Copy link

SparkQA commented Nov 6, 2018

Test build #98519 has finished for PR 22823 at commit 610fc31.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yucai
Copy link
Contributor Author

yucai commented Nov 6, 2018

retest this please

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yucai , @cloud-fan , @gatorsmile , @gengliangwang .

This PR looks good to me. This refactoring is not for choosing the best default value for the configuration. It's beyond of the scope.

The result on different machines(JVM/CPU/DISK) are not the same. This PR clearly shows that the benefit of the previous @yucai 's configuration PR, #22847. Users are able to run some benchmarks (including this) with various configurations on their machine and workloads, and to choose their best values with that configuration.

@SparkQA
Copy link

SparkQA commented Nov 6, 2018

Test build #98522 has finished for PR 22823 at commit 610fc31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 63ca4bb Nov 6, 2018
@dongjoon-hyun
Copy link
Member

Thank you all.

Seq("10", "100", "1024", "2048", "4096", "8192", "65536").foreach { n =>
benchmark.addCase(s"split threshold $n", numIters = 5) { iter =>
withSQLConf(SQLConf.CODEGEN_METHOD_SPLIT_THRESHOLD.key -> n) {
df.selectExpr(columns: _*).foreach(identity(_))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, All.
It turns out that this breaks Scala-2.12 build. I made a PR to fix that. #22970

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks!

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…nction configurable

## What changes were proposed in this pull request?
As per the discussion in [apache#22823](https://github.com/apache/spark/pull/22823/files#r228400706), add a new configuration to make the split threshold for the code generated function configurable.

When the generated Java function source code exceeds `spark.sql.codegen.methodSplitThreshold`, it will be split into multiple small functions.

## How was this patch tested?
manual tests

Closes apache#22847 from yucai/splitThreshold.

Authored-by: yucai <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…e main method

## What changes were proposed in this pull request?

Refactor BenchmarkWideTable to use main method.
Generate benchmark result:

```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.WideTableBenchmark"
```

## How was this patch tested?

manual tests

Closes apache#22823 from yucai/BenchmarkWideTable.

Lead-authored-by: yucai <[email protected]>
Co-authored-by: Yucai Yu <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants